Comments

  1. Output from skim() is not being displayed because you have set results = "hide" me: clean up viz

  2. Incorrect use of asterisk. You need to close them to get desired effect (italic, bold)
  3. Avoid printing out entire vectors since their outputs will take up a lot of space (unique(flights$arr_delay)). Moreover, a bunch of numbers on the screen does give the reader any further insights.
  4. Consider removing groups with low number of observations from consideration since they could skew your results
  5. General convention - explainatory variable is on the x-axis, variable being explained on y. In our case, y - axis should be the delay variables. You have this reversed for most of your charts. Flip this around.
  6. Based on your analysis there seems to be a relationship between distance and delays with shorter flights being more delayed. Why is this? Explore this further.
  7. Consider cutting continous variables like distance in intervals (look at cut_*() set of function under ggplot)
  8. Explore your explainatory variable more thoroughly before the analysis
  9. Create a single dataset for weather + flights and use it for all the weather based analysis instead of creating separate ones each time.
  10. Use snake_case for naming your variables.
  11. Explain what is happening in each chart and your logic for trying something out. For instance, you are interested in precip vs delay because xyz.
  12. Another way to look at this would be to consider extreme values of delay and see the relationship between that and the explainatory variable.
  13. Consider tabsets # {.tabset .tabset-fade .tabset-pills}
  14. Take logs when you have a skewed variable (that spans a large range) (check out scale_x_log10())
  15. Make things more attractive

Due Date: Sunday, April 28th

Keyboard Shortcuts:

  1. Assignment operator (<-) is Alt+- for Windows and Option+- for Mac
  2. Insert new code chunk Ctrl+Alt+I for Windows and Command+Option+I for Mac
  3. Run a line of code Ctrl+Enter for Windows and Command+Enter for Mac
  4. Run all the code within a chunk Ctrl+Shift+Enter for Windows and Command+Shift+Enter for Mac
  5. Insert a pipe operator (%>%) Ctrl+Shift+M for Windows and Command+Shift+M for Mac

General Instructions

  1. Create a new notebook (copy parts of the YAML header you are interested in keeping) to work on your assignment.
  2. The assignment should be submitted via a shareable github link that points to your notebook (.nb.html) file to me on Slack.
  3. Make sure the notebook is neatly formatted, and that all the charts and tables and look good when you preview it.

Task: Explore at least 5 different causes for departure and arrival delays (combined - not 5 each). Its not necessary that you find a definitive answer for what causes a delay, instead I want you to showcase your ability to explore the data with a combination of analytical text, tables and charts. The final output should be something you can show a prospective employer or client as proof of you ability to work with data in R.

## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0       v purrr   0.3.1  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Warning: package 'nycflights13' was built under R version 3.5.3
## Warning: package 'skimr' was built under R version 3.5.3
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
## 
##     filter

Let’s talk game plan

I want to try: distance, month, dest, precip, temp, visib for all of these, I may want to split by origin. My graph options include histogram, geom_point (scatter), geom_jitter, geom_smooth(method=lm), geom_line for line chart, geom_col for discrete x continuous y, geom_boxplot, geom_count and for one var, geom_hist, geom_density, geom_dotplot.

** Important notes you have never got re summarise. Group by the “background” vars you want to see compared to your dependent var. use the summarize command to perform functions on the dependent var. Don’t group by the dependent var**

*Section One First hypothesis: Distance and arrival delay, let’s see if there is a relationship. REMAINDERS: Charts A-3 and A-5 have wasted space in some of the origin facets. REMAINDERS: What we found would make it interesting to check dep_delay and Distance.

A-1: Let’s look at the data.

  unique(flights$arr_delay)
##   [1]   11   20   33  -18  -25   12   19  -14   -8    8   -2   -3    7   31
##  [15]   -4   -7   -6   16  -12  -17   32   14    4  -21   -9    3    5    1
##  [29]   29   10    0  -19   48   -5  -10  -11  -33   27  -23    2  -31   44
##  [43]  -15   21  -26   -1   30   26   49  -30  -22  -13  -28  137   23   17
##  [57]  -16  -40   24   51   15  -24  -39  -27  851    9    6   50   40  -29
##  [71]   28   13   42   39  123  145   78   34   22   38   43  -38   36   81
##  [85]   53   93   56   37   73   18   46   60  103   84   66   65   83  127
##  [99]   52  125   NA   47   35  115   91  136   67   72  -35   80   59   96
## [113]  -34   25   41  -32   68   61  107  138  116  338   94  263  -47  -48
## [127]   54  151  166  174   75  222   45   55  250  142  246  191   69  456
## [141]  -20  154  -52  -44  -59   90  -41  -36  -45  171   58   70  -57   57
## [155]  173   62   99   97   85   63   98  102  -55  207  -42  288  323  119
## [169]  -37  130  109  368  122   82  131   71  141  120  106  -51   77  158
## [183]  180   64  176   88  182  157  359   89  252  110  100  111  101  172
## [197]  143  -43   92  -46  167  133  -65  -63  113   79  128  -49  -61  175
## [211]  -53  -56  -60  270  257  105  285  150  -50  -70  126  161  162  118
## [225]  132   87  276  156  112  248  308  213  160   95  155   74  121   86
## [239]  108  117  -54  168  144  184  147 1272  140   76  124  247 1109  178
## [253]  394  292  348  181  200  245  230  149  177  297  189  215  135  612
## [267]  146  185  134  211  114  228  190  104  243  203  272  -64  328  187
## [281]  258  159  206  497  212  199  139  129  193  235  169  299  271  238
## [295]  221  237  232  265  231  163  164  281  186  153  264  152  192  165
## [309]  236  256  196  262  198  255  201  223  233  183  486  194  205  214
## [323]  282  210  273  340  218  370  330  275  226  188  179  249  241  209
## [337]  364  325  225  267  240  239  197  266  229  278  351  -58  -62  219
## [351]  294  148  217  224  253  220  208  227  335  204  268  195  311  170
## [365]  360  216  301  366  202  312  284  688  314  244  350  374  279  381
## [379]  378  291  417  300  318  796  434  345  344  -67  274  302  385  234
## [393]  269  242  313  309  389  346  329  277  321  290  304  614  296  324
## [407]  280  396  334  681  337  283  326  878  377  357  354  261  443  356
## [421]  260  538  422  361  336  307  376  259  254  322  856  386  298  319
## [435]  648  295  395  493  846  251  372  433  847  362  353  469  310  390
## [449]  -68  331  397  384  834  327  595  744  767  289  773  -69  -66  387
## [463]  410  305  375  317  436  415  391  293  380  915  343  316  402  303
## [477]  784  506  315  383  306  332  560  399  373  616  931  347  516  428
## [491]  349  333  286  341  435  783  821  405  408  -73  -75  875  369  -74
## [505]  -71  -86  398  453  499  495  852  -79  287  461  407  379  448  598
## [519]  352  451  339  420  421  430  468  496  342 1127  411  489  367  444
## [533]  474  424  769  780  802  850  400  320  431  363  490  485  401  419
## [547]  561  382  676  632  404  445  414  441  475  412  551  371  355  645
## [561]  463  989  406  895  393  458  455  577  454  449  358  403  674  505
## [575]  492  572  571 1007
  unique(flights$distance)
##   [1] 1400 1416 1089 1576  762  719 1065  229  944  733 1028 1005 2475 2565
##  [15] 1389  187 2227 1076 1023 1020  502 1085  760 2586 1074 1598  746 2133
##  [29] 1096 1008 2153 2454  185  529  212  950  301 1620 2434 2248 1182 1035
##  [43] 1990 1147  997 1069 1068 1608  738 2402 1372 1605  264 1080  937  209
##  [57] 1010 1041 2422  544  427  200  541  479  820 1029  628  946  340 2446
##  [71]  213  419  888  828  740  550  431  509  184 1874  416  488 2425  711
##  [85] 4983  764 1521  266   96 1623 1634  404  228 1726  199  725 1047  583
##  [99]  273  872  282  645  284  214  708  397  195  463  246  169 1428 1183
## [113] 2465  533 1411  143  116 1969  319 4963  799  445  963  569  425 1504
## [127]  277  665  292  483  594  605 1092  765 1391  589  335  748   94 1107
## [141] 1626 1747  290 1569  642 2569 1134  636  444 2576 2521 1585 1215 1587
## [155]  631 1325  160 1017 1617  585 1131  618  254  964  647  866  269 1167
## [169]  602  288   80 1728 1882 1795 1207 2378  461  833  258  641  722  659
## [183]  660  549  378  812  296 1417 1113 1148 1031  610 1826  500  198  173
## [197]  651  892  604  637 1725 1746  617 1894  662  305  745  599  865  655
## [211]  563  644 3370   17
flights %>% 
  group_by(distance, origin) %>% 
  summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>% 
  arrange(distance)

Nothing too obvious here. A-2: Let’s check out distance outliers.

flights %>% 
  ggplot(aes(x = distance)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are some.

A-3 - Let’s facet by origin for our first major two-variable graph.

flights %>% 
  group_by(distance, origin) %>% 
  summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>% 
  arrange(distance) %>% 
  ggplot(aes(y = meandelayQ101, x = distance)) +
  geom_point() +
  geom_smooth(method = lm) +
  facet_wrap(~ origin, scales = "free_x") ##hari made this change on scales
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

flights
flights %>% 
    filter(!is.na(distance) & !is.na(air_time)) %>% 
    group_by(distance) %>% 
    summarise(meanSpeed = mean(distance, na.rm = T)/mean(air_time, na.rm = T))
flights %>% 
    filter(!is.na(tailnum)) %>% 
    group_by(month, day, tailnum) %>% 
    summarise(nLegs = n(),
              meanDistance = mean(distance, na.rm = T)) %>% 
    arrange(desc(nLegs))

It looks like distance falls as arrival delay increases, except at LaGuardia. A little weird. The segmentation of origin I did is a bit random.

A-4: Does the relationship hold across the combined airports?

flights %>% 
  group_by(distance, origin) %>% 
  summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>% 
  arrange(distance) %>% 
  ggplot(aes(x = meandelayQ101, y = distance)) +
  geom_point() +
  geom_smooth(method = lm)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

It does. A-5: Let’s try minus outlier distances.

flights %>% 
  group_by(distance, origin) %>% 
  filter(distance < 3000) %>% 
  summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>% 
  arrange(distance) %>% 
  ggplot(aes(x = distance, y = meandelayQ101)) +
  geom_point() +
  geom_smooth(method = lm) +
  facet_wrap(~ origin)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

Result: still a negative relationship between distance and delay. weird. Mean delay is highest for very short flights. A-6: Let’s remove the origin faceting and check one more time.

flights %>% 
  group_by(distance, origin) %>% 
  filter(distance < 3000) %>% 
  summarise(meandelayQ101 = mean(arr_delay, na.rm = T)) %>% 
  arrange(distance) %>% 
  ggplot(aes(x = distance, y = meandelayQ101)) +
  geom_point() +
  geom_smooth(method = lm) 
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

There is a trend here.

*SECTION TWO Hypothesis Two: there is a relationship between Precip in weather and Departure Delay. (Spends a full hour understanding joins again) B-1: Let’s make our dataset. Right now, Precip and flights are separated.

weather
Preciptest <- flights %>%
  select(time_hour, origin, dep_delay) %>% 
  left_join(
    x = .,
    y = weather,
    by = c("origin", "time_hour"))

Ok, we have a dataset. Let’s look at precip.

Never finished this

Preciptest 
Preciptest %>% 
  group_by(precip, origin) %>% 
  summarise(meandepdelayQ201 = mean(dep_delay, na.rm = T)) %>% 
  arrange(meandepdelayQ201) %>% 
  ggplot(aes(meandepdelayQ201, y = precip)) +
  geom_point() +
  geom_smooth(method = lm)
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

Looks like a bingo. Let’s wrap it by origin.

Preciptest %>% 
  group_by(precip, origin) %>% 
  summarise(meandepdelayQ201 = mean(dep_delay, na.rm = T)) %>% 
  arrange(meandepdelayQ201) %>% 
  ggplot(aes(meandepdelayQ201, y = precip)) +
  geom_point() +
  geom_smooth(method = lm) +
  facet_wrap(~ origin)
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

What else can we facet wrap?

Preciptest <- flights %>%
  select(time_hour, carrier, origin, dep_delay) %>% 
  left_join(
    x = .,
    y = weather,
    by = c("origin", "time_hour"))

Preciptest %>% 
  group_by(precip, carrier) %>% 
  summarise(meandepdelayQ201 = mean(dep_delay, na.rm = T)) %>% 
  arrange(meandepdelayQ201) %>% 
  ggplot(aes(meandepdelayQ201, y = precip)) +
  geom_point() +
  geom_smooth(method = lm) +
  facet_wrap(~ carrier)
## Warning: Removed 31 rows containing non-finite values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).

Interesting and a bit hard to parse, but the relationship certainly varies by carrier.

#geom_

SECTION THREEE ##Exploring Visib and arrival delay

get the joins going

library(tidyverse)
library(nycflights13)
DataSet4 <- flights %>% 
  select(time_hour, origin, arr_delay) %>% 
  left_join(weather, by = c("origin", "time_hour"))

Remind myself what flights looks like.

str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

Check our combined dataset to make sure it looks right

str(DataSet4)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
##  $ origin    : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ arr_delay : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ year      : num  2013 2013 2013 2013 2013 ...
##  $ month     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour      : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ temp      : num  39 39.9 39 39 39.9 ...
##  $ dewp      : num  28 25 27 27 25 ...
##  $ humid     : num  64.4 54.8 61.6 61.6 54.8 ...
##  $ wind_dir  : num  260 250 260 260 260 260 240 260 260 260 ...
##  $ wind_speed: num  12.7 15 15 15 16.1 ...
##  $ wind_gust : num  NA 21.9 NA NA 23 ...
##  $ precip    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pressure  : num  1012 1011 1012 1012 1012 ...
##  $ visib     : num  10 10 10 10 10 10 10 10 10 10 ...

Let’s look at Visib

DataSet4 %>% 
  ggplot(aes(x = visib)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1556 rows containing non-finite values (stat_bin).

Hmm, looks discrete

DataSet4$visib %>% 
  sort() %>% 
  unique()
##  [1]  0.00  0.06  0.12  0.25  0.50  0.75  1.00  1.25  1.50  1.75  2.00
## [12]  2.50  3.00  4.00  5.00  6.00  7.00  8.00  9.00 10.00

After Trial and Error, arrive at this, which is ugly and needs improvement

DataSet4 %>% 
  ggplot(aes(x = visib, y = arr_delay)) +
  geom_point() +
  geom_smooth(method = lm)
## Warning: Removed 10957 rows containing non-finite values (stat_smooth).
## Warning: Removed 10957 rows containing missing values (geom_point).

So the LM line appears to visually disagree with the superficial impression of the points. Are outliers affecting this? Let’s check sample sizes.

DataSet4 %>% 
  group_by(visib) %>% 
  summarise(nObs = n())

Sizes are acceptable. I’m not sure what to think about this relationship. The scale of the graph could be affecting the line. Let’s try a box plot. First, i think I need to purge the NAs. First, how many rows are there before I start.

DataSet4 %>% 
  nrow()
## [1] 336776

Ok, let’s try something re NA purge.

DataSet4 %>% 
  filter(!is.na(visib))

This is what I have in mind, but it doesn’t work yet.

DataSet4 %>% 
  group_by(visib) %>% 
ggplot(aes(x = as.factor(visib), y = arr_delay), na.rm = T ) +
  geom_boxplot()
## Warning: Removed 9430 rows containing non-finite values (stat_boxplot).

Okay, this kinda works! You could sort of read a trend into this, but it seems to decline around “1”

 DataSet4 <- DataSet4 %>%
  filter(visib > 0.11)

DataSet4 %>% 
  ggplot(aes(x = visib, y = arr_delay)) +
  geom_point() +
  geom_smooth(method = lm)
## Warning: Removed 9369 rows containing non-finite values (stat_smooth).
## Warning: Removed 9369 rows containing missing values (geom_point).

A clearer relationship now, via cheating :-D